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TECHNICAL  PROGRESS  REPORT 

NUMBER  5 

Topic  Number:  SDIO  88-10 

Title:  Three  Dimensional  Cellular  Automata  for  Subpixel  Target  Detection 

Contract  Number:  N00014-88-C-0717 

From:  Kensal  Consulting,  Tucson,  Arizona  (Code:  0D9C9) 

To:  Dr.  Keith  Bromley,  NOSC,  San  Diego  (Code:  N00014) 

Project  Description: 

This  project  on  subpixel  target  detection  relates  to  research  in  the  optimization 
of  three-dimensional  computing  structures  for  use  in  target  detection  and  to 
research  in  the  reduction  of  an  optimum  computing  structure  to  an 
efficiently-designed  silicon  chip. 

Technical  Progress: 

During  February  research  continued  furthering  the  analysis  first  reported  in 
Technical  Progress  Report  Number  1  and  Technical  Progress  Report  Number  4, 
namely,  the  mathematical  optimization  of  device  design  assuming  a  planar  structure 
for  executing  cellular  logic  transforms.  The  optimization  criterion  is  the 
maximization  of  pixops  (picture  point  operations)  per  device.  As  reported  in 
Technical  Progress  Report  Number  4,  it  has  been  discovered  that  using  a  variable 
on-chip  data  window  size  improves  performance.  However,  in  our  previous  work, 
equations  were  not  developed  which  showed  clearly  the  basis  of  the  optimum 
conditions  discovered,  although  the  optima  were  clearly  defined.  This  has  now  been 
rectified  and  a  final  analysis  of  tv;c-^imeusior.aI  (planar)  structures  has  been 
completed.  The  results  are  dramatic.  There  is  a  well-defined  optimum  design 


which  had  not  heretofore  been  recognized.  The  results  of  this  research  will  have 
a  major  impact  assuming  that  prototyping  of  devices  for  cellular  logic  transforms  is 
funded  in  response  to  our  Phase  II  Proposal  (to  be  prepared  in  April).  Results  are 
given  below. 

Two-Dimensional  Case 

Assume  that  one-bit  pixels  in  a  square  data  field  of  span  S  must  be  processed 
using  a  neighborhood  processing  device  with  a  small  on-chip  data  memory.  This 
device  is  assumed  to  receive  3  pixels  (bite)  each  clock  cycle  t  from  the  data  field. 
It  is  also  assumed  to  contain  B  LUTs  (LookUp  Tables)  capable  of  flash  processing 
all  B  pixels  (bits)  simultaneously  and  returning  them  to  the  data  field  in  another, 
non-overlapping  clock  cycle  t.  Finally,  it  is  assumed  that  the  on-chip  data  window 
memory  is  capable  of  storing  M  pixels  (bits).  Initially  multiple  redundancy  in  M  is 
not  assumed.  This  will  be  treated  later.  It  is  assumed,  however,  that  M  can  be 
configured  as  a  matrix  with  an  arbitrary  number  of  rows  R  and  columns  C  =  M/R. 

The  purpose  of  the  following  analysis  is  to  determine  the  minimum  processing 
time  for  a  specific  value  of  M.  Then  to  apply  this  determination  to  an  analysis 
which  finds  the  value  of  M  which  yields  the  maximum  number  of  pixels  processed 
per  unit  time  per  on-chip  transistor.  From  this  the  best  processing  device  design 
can  be  found. 

The  time  T  to  fill  M  is  given  by 

Tl  =  Mt/B  (1) 

In  this  time  every  B-bit  entity  must  have  available  the  eight  other  B-Bit  entities 
that  form  the  neighborhoods  of  all  B  bits.  For  example,  if  B=l,  then  at  least  eight 
other  one-bit  entities  must  be  stored  in  M  to  form  a  single  3x3  neighborhood.  In 
this  case  R=3,  C=3,  and  M=9.  Conversely  for  B=16  (halfword  transfer)  at  least 
eight  other  16-bit  entities  must  be  stored  in  M  so  that  B  bits  may  be  processed 
simultaneously.  This  yields  a  minimum  value  of  M  =  144,  R=3,  and  C=48.  Thus,  in 
general  the  number  of  pixels  processed  per  round-trip  cycle  through  M  are  given  by 

N=(R-2)(C-2B)  =  (R-2XM/R-2B)  (2) 

The  processed  values  of  these  pixels  are  loaded  in  a  time  7^  given  by 


T2  =  (R-2)(M/R-2B)t/B  (3) 

Adding  equations  (1)  and  (3)  provides  the  round  trip  processing  time  as 

TQ  =  (M+(R-2)(M/R-2B))t/B  (4) 

O 

The  number  of  round-trip  cycles  required  is  simply  S“/N  so  that  the  total 
processing  time  T  is  given  by 

T  =  Tq(S2/N) 

=  S2(M+(R-2)(M/R-2B))t/B(R-2)(M/R-2B) 

=  S2Mt/B(R-2)(M/R-2B)  +  S2t/B  (5) 

For  a  given  design,  M,  B,  and  t  are  fixed  and,  for  a  given  processing  task,  S  is 
fixed.  Under  these  assumptions  the  second  term  in  equation  (5)  is  a  constant  as  is 
the  numerator  in  the  first  term.  Thus,  to  minimize  T,  for  a  given  task  and  design, 
the  denominator  of  the  first  term  in  equation  (5)  must  be  maximized.  This 
denominator  may  be  expressed  in  terms  of  a  constant  A  given  by 

A  =  BM  +  4B2  (6) 


and  a  function  of  R  given  by 


f(R)  =  -  2BM/R  -  2B2R  (7) 

differentiating  f(R),  setting  the  result  equal  to  zero,  and  solving  for  R  yields 

R  =  (M/B)  ^  (8) 

Substituting  this  result  in  (5)  yields  the  minimum  processing  time  as  follows 

Tmin(M,B,S,t)  =  S2Mt/B((M/B^-2)((MB^-2B)+S2t/B  (9) 

Letting  S=512  and  t  =  100ns,  equation  (9)  is  plotted  in  Figures  1-4  as  a  function  of 
M  for  B=l,  4,  8,  16,  respectively,  corresponding  to  bit,  nibble,  byte,  and  halfword 


3 


entities.  Examination  of  these  figures  shows  that  the  processing  time  for  the 

optimum  configuration  of  M  (equation  (8))  falls  rapidly  only  over  a  relatively  small 

2 

range  of  M  and  reaches  asymtotic  value  given  by  2S  t/B.  This  indicates  that 
halfword  (B  =  16)  flash  processors  using  large  values  of  M,  such  as  the  PHP  of 
Carnegie-Mellon  (M  -  65536)  are  wasteful  of  transistors.  As  shown  in  Figure  4, 
little  is  gained  in  such  designs  by  increasing  M  much  beyond  1024.  Also  note  that, 
for  M  -  1024,  the  optimum  configuration  of  M  is  highly  rectangular,  namely,  8  rows 
by  128  columns. 

Now  consider  optimization  in  terms  of  the  number  of  pixops  (pixel  operations) 
per  unit  time  per  transistor.  Using  triple  redundancy  in  M  for  maximum  speed 
(U.S.  Patent  4,641,351  ,  K.  Preston  Jr.),  the  total  number  of  transistors  is  given 

by 


D  =  12M  +  BL  (10) 

where  L  is  the  number  of  transistors  per  LUT  and  it  is  assumed  that  four 
transistors  are  required  per  memory  cell.  For  the  3x3  neighborhood  flash 
processor  considered  here  L  =  4x512  =  2048.  Pixops  per  unit  time  per  transistor 
are  given  by 

P  =  S2/TD  (11) 

ComDining  equations  (9),  (10),  and  (11)  yields 

P  =  B((M/B)  -2)((M Bp-2B)/(12M+2048B)iVlt^+  B/(12M+2048B)t  (12) 

Equation  (12)  is  plotted  in  Figuresl-4afor  B=l,  4,  8,  16,  respectively,  using  t  = 
100ns.  These  plots  correspond  to  Figures  1-4  and  furnish  a  graphical  solution  to 
the  maximization  of  equation  (12)  as  a  function  of  VI.  The  optimum  values  of  VI 
found  are  given  in  the  below  table 

B  Optimum  M 

1  ~  49 

4  '  225 

8  '441 

16  '  900 
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It  is  worth  noting  that  these  design  optima  are  independent  of  the  span  of  the  data 
field  S.  This  again  points  up  the  fallacy  that  large  values  of  M  are  required  when 
processing  large  data  fields.  In  fact,  as  M  increases  beyond  the  optima  given 
above,  efficiency  falls,  i.e.,  transistors  are  being  wasted. 

Preliminary  Design  Analysis 

The  appendix  to  this  technical  progress  report  has  been  submitted  by  Visual 
Information  Technologies  (Texas)  as  part  of  their  subcontracted  research  on  chip 
design.  This  analysis  has  benefitted  from  the  research  conducted  by  Kensal 
Consulting  in  that  the  research  at  Visual  Information  Technologies  has  been 
conducted  in  close  coordination  with  our  own. 


Plans  for  Contract  Completion: 

Plans  for  contract  completion  include  (1)  the  extension  of  our  two-dimensional 
analysis  to  the  final  three-dimensional  structure  for  use  in  target  track  detection 
and  (2)  completion  of  the  design  analysis  being  conducted  at  Visual  Information 
Technologies.  Also  it  is  likely  that  the  software  emulator  being  built  on  the 
Macintosh  II  will  be  finished  and  some  preliminary  results  furnished  (see  Technical 
Progress  Report  Number  3).  At  the  same  time  the  Phase  II  proposal  will  be 
outlined  in  preparation  for  its  submission  in  April  1989. 
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Data  Window  Memory 
Figure  4A  -  Word  Data  Transfer 
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Introduction 

The  purpose  of  this  report  is  to  outline  progress  that  has  been  made  in  the 
understanding  of  a  chip  implementation  of  a  3-dimensional  image 
morphology  processor.  It  is  the  intent  of  the  research  to  outline  the 
trade-offs  associated  with  this  chip  development.  This  research  is 
conducted  in  association  with  Kensal  Consulting  concerning  a  sub-pixel 
target  detection  system,  and  is  funded  by  an  SBIR  grant  from  the  SDIO. 
VITec  (Visual  Information  Technologies,  Inc.)  is  uniquely  qualified  to 
participate  in  this  research  because  of  its  background  in  VLSI  based  image 
processing  [1 ,2]. 

Background 

A  thorough  understanding  of  the  3D  morphology  processing  has  been 
attained.  Gaining  fluency  with  the  underlying  concepts  is  essential  in 
understanding  the  appropriate  tradeoffs  to  be  made  in  the  VLSI 
implementation  of  a  system  component. 

Current  image  processing  techniques  typically  involve  processing 
intensity  values  represented  by  integers  with  8  to  16  bits  of  precision. 
These  intensity  values  are  samples  taken  at  points  in  a  rectangularly 
tesselated  space,  which  is  a  generalization  of  traditional  1 D  signal 
processing  concepts  to  two  or  more  dimensions.  Examples  of  machines 
performing  image  processing  in  this  manner  are  the  VITec  Image 
Computer,  the  Pixar  Image  Computer,  as  well  as  systems  from  Vicom  and 
Gould.  A  2D  neighborhood,  or  kernal,  would  consist  of  9  pixels,  while  a  3D 
kernal  would  consist  of  27  voxels. 

In  this  study,  pixels  are  binary  variables,  having  been  processed  by  a 
thresholding  circuit.  Furthermore,  these  voxels  are  samples  from  a 
hexahedrally  tessellated  space,  such  that  each  sample  is  equidistant  from 
all  of  its  neighbors.  This  tessellation  has  a  physical  analogy  to  the  cubic 
close-packed  crystal  lattice  [3].  In  three  dimensions,  the  hexahedral 
tessellation  leads  to  a  neighborhood  of  13  voxels,  the  outer  twelve  of 


which  form  the  vertices  of  a  tetradecahedron,  a  polyhedra  having  14  faces. 

Since  each  pixel  is  a  single-bit  value,  it  can  be  regarded  as  a  binary 
variable,  which  leads  to  the  concept  of  a  "logical  transform"  of  a  3D 
neighborhood.  In  the  case  of  the  proposed  chip,  a  structure  would  be  built 
whose  inputs  are  the  13  neighborhood  voxels  of  the  hexahedrally 
tessellated  space,  and  whose  single  output  is  an  arbitrary  logical  function 
of  those  1 3  inputs.  This  structure  directly  implements  a  truth  table  with 
2**13,  or  8192,  entries,  and  corresponds  to  a  8Kx1  RAM  (Random  Access 
Memory)  which  serves  as  a  LUT  (LookUp  Table).  It  is  clear  that  the 
hexahedrally  tessellated  space  is  particularly  suitable  for  this  form  of 
processing,  since  a  3D  neighborhood  of  a  rectangularly  tessellated  space 
would  have  27  voxels,  requiring  a  LUT  having  2**27  entries.  Once  this 
basis  is  established,  the  following  architectural  issues  are  apparent: 

•  Should  there  be  parallel  LUTs  on  a  single  chip 

•  Is  it  necessary  to  implement  an  arbitrary  logical  function  of  13 
variables.  Research  indicates  that  most  logical  transforms 
already  in  practice  may  be  described  as  a  composite  function  of 
the  an  arbitrary  function  of  the  12  outer  voxels  and  the  center 
voxel  (described  further  below)  [4], 

•  How  is  image  data  efficiently  transfered  as  13  neighborhood 
voxels  to  the  LUT. 

•  How  is  result  data  returned  once  it  is  processed  by  the  LUT. 

•  Are  there  any  constraints,  such  as  cost,  space,  rad-hard,  power, 
maximum  transistor  count,  etc.,  or  should  all  of  these  be 
"sensibly  minimized". 

There  are  several  implementation  issues  to  be  studied  as  well.  These 
include: 

•  What  is  the  cheapest,  most  reliable  and  most  well  understood 
semiconductor  process  that  the  performance  and  cost 
requirements  will  allow. 

•  What  package  should  be  used. 

•  How  should  the  RAM  be  designed  (for  example,  should  a 
3-transistor  dynamic  cell  or  a  6-transistor  static  cell  be  used). 

•  What  integrated  circuit  design  methodology  should  be  used. 

Architectural  Studies 

A  proposed  chip  architecture  is  shown  in  figures  la  and  1b,  which  fits 
well  into  a  system  proposal  by  Kensal  Consulting.  In  the  processor,  data 
is  loaded  into  shift  registers,  once  per  cycle.  Although  figure  one  shows 
data  being  loaded  8b  at  a  time,  the  exact  number  of  bits  loaded  at  a  time 


is  a  parameter  which  needs  further  study.  There  are  seven  shift  registers, 
corresponding  to  the  7  voxels  of  a  hexagonal  neighborhood  in  a  particular 
frame.  Three  frames  are  needed  to  construct  a  13  voxel  3D  neighborhood, 
however  the  top  and  bottom  3  voxels  in  the  3D  neighborhood  occupy  3  of 
the  7  bit  positions  of  the  hexagon  (for  example,  see  [3],  page  54).  Once 
these  7  shift  registers  are  loaded,  processing  may  begin.  In  the  8b 
example  of  figures  la  and  1b,  8  neighborhoods  are  passed  to  the  single 
LUT,  one  at  a  time  for  8  cycles,  where  the  LUT’s  output  fills  an  output 
shift  register.  The  shift  register’s  contents  are  then  outputted  in  parallel 
to  an  image  memory.  There  is  a  3  cycle  latency  in  getting  the  pipeline 
started. 

The  difference  between  figures  la  and  1b  is  that  the  LUT  in  figure  1b  has 
been  halved  in  size,  and  its  output  is  processed  by  the  center  voxel  logic 
to  form  a  composite  logical  function.  This  center  voxel  logic  is  shown  in 
figure  2.  As  can  be  seen  in  figure  2,  a  4b  control  field,  which  defines  any 
one  of  the  possible  16  logical  functions  of  two  variables,  defines  the 
logical  function  of  the  LUT  output  and  the  center  voxel.  It  should  be 
stressed  that  this  composite  processing  does  not  replace  the  general 
function  of  an  8K  LUT,  but  from  known  morphological  logical  transforms, 
complete  generality  is  not  required,  and  this  assumption  halves  the  size 
of  the  largest  chip  component:  the  LUT  RAM. 

Further  study  will  be  performed  on  the  other  architectural  issues,  and 
summarized  in  a  final  report  at  the  conclusion  of  this  study. 

Chip  Implementation  Studies 

Although  the  processing  bandwidth  of  the  chip  is  presently  undefined,  it  is 
likely  that  a  standard  ASIC  CMOS  process  will  be  adequate  to  implement 
this  chip.  Presently,  most  CMOS  integrated  circuit  design  is  being 
conducted  using  Ip  or  less  minimum  photolithographic  design  rules,  which 
allows  for  extremely  high  packing  densities.  For  example,  if  one  were  to 
examine  the  1989  International  Solid  State  Circuit  Conference  Digest  of 
Technical  Papers,  he  would  find  that  most  papers  involving  digital 
circuitry  use  Ip  processes,  while  simultaneously,  chip  sizes  continue  to 
increase  dramatically.  For  example,  a  1  million  transistor  microprocessor 
was  presented  (by  Intel)  which  had  a  die  area  of  150  square  millimeters. 

By  the  time  that  an  implementation  of  this  chip  is  started,  Ip  processing 
will  be  commonplace. 

CMOS  processing  is  preferred  because  of  its  low  power  dissipation,  high 
speed,  proliferation  of  use,  and  insensitivity  to  noise.  CMOS  performance 
is  typically  adequate  for  all  but  the  most  demanding  performance 
requirements. 


The  chip  requirements  here  are  much  less  demanding.  Consider  the  area  of 
a  standard  6-transistor  fully  static  RAM  cell  (the  reasons  for  selecting 
this  cell  design  will  be  presented  in  the  final  report),  which  we  have 
designed  and  is  13.2p  x  21  .6ji,  using  Iji  design  rules  as  published  by  VLSI 
Technology,  Inc.  Assuming  30%  overhead  for  RAM  decoding,  sense  and 
write  circuitry,  the  RAM  area  is  only  3.0  square  millimeters.  This  area  is 
extremely  small,  and  should  result  in  very  low  costs.  Using  this  process, 
it  is  certainly  possible  to  design  a  LUT  which  could  be  cycled  at  more  than 
50  MHz. 

Thus,  with  the  semiconductor  processing  currently  available  today,  the 
research  should  not  be  too  constrained  by  chip  area.  Process  issues  will 
be  further  refined  in  the  final  report. 

Conclusion 

Further  research  should  be  conducted  to  improve  on  the  availability  of 
data  to  the  LUT,  so  that  it  is  always  kept  busy,  perhaps  adding  complexity 
to  the  way  data  is  brought  into  the  proposed  chip.  Further  study  should  be 
conducted  along  the  lines  of  parallel  LUTs  within  the  chip.  With  today's 
CMOS  processing  capability,  schemes  to  reduce  an  8K  LUT  to  a  4K  LUT  are 
not  worthwhile,  in  light  of  the  generality  which  is  sacrificed. 
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